Add CSI registry allow list check to admission webhook#48011
Add CSI registry allow list check to admission webhook#48011
Conversation
When using CSI-based library injection, the admission webhook now checks the registry allow list before adding CSI volumes to pods. If a library's registry is not in the allow list, the webhook skips adding the CSI volume entirely. This provides defense in depth alongside the CSI driver's own registry check — the webhook gives clean UX (no unnecessary volumes), while the CSI driver acts as a security backstop. The config key admission_controller.auto_instrumentation.csi_registry_allow_list is set via the Helm chart from the same value used for the CSI driver's DD_REGISTRY_ALLOW_LIST.
Files inventory check summaryFile checks results against ancestor 695e39f7: Results for datadog-agent_7.79.0~devel.git.873.6f02d12.pipeline.108181075-1_amd64.deb:No change detected |
Static quality checks✅ Please find below the results from static quality gates Successful checksInfo
22 successful checks with minimal change (< 2 KiB)
On-wire sizes (compressed)
|
6dfd81c to
321c129
Compare
Use ParseEnvAsStringSlice (the established pattern in the codebase) to register an env var transformer that splits comma-separated values. This replaces the ad-hoc splitStringSlice wrapper and is consistent with how other string slice configs handle the Viper limitation (e.g. apm_config.features, apm_config.ignore_resources).
321c129 to
f46dad4
Compare
Regression DetectorRegression Detector ResultsMetrics dashboard Baseline: 8b782fd Optimization Goals: ✅ No significant changes detected
|
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ❌ | docker_containers_cpu | % cpu utilization | +6.84 | [+3.75, +9.93] | 1 | Logs |
Fine details of change detection per experiment
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ❌ | docker_containers_cpu | % cpu utilization | +6.84 | [+3.75, +9.93] | 1 | Logs |
| ➖ | otlp_ingest_metrics | memory utilization | +0.73 | [+0.58, +0.88] | 1 | Logs |
| ➖ | quality_gate_metrics_logs | memory utilization | +0.70 | [+0.46, +0.94] | 1 | Logs bounds checks dashboard |
| ➖ | otlp_ingest_logs | memory utilization | +0.49 | [+0.38, +0.59] | 1 | Logs |
| ➖ | quality_gate_logs | % cpu utilization | +0.30 | [-1.36, +1.96] | 1 | Logs bounds checks dashboard |
| ➖ | quality_gate_idle_all_features | memory utilization | +0.29 | [+0.25, +0.32] | 1 | Logs bounds checks dashboard |
| ➖ | tcp_syslog_to_blackhole | ingress throughput | +0.18 | [+0.01, +0.34] | 1 | Logs |
| ➖ | file_to_blackhole_0ms_latency | egress throughput | +0.08 | [-0.43, +0.59] | 1 | Logs |
| ➖ | tcp_dd_logs_filter_exclude | ingress throughput | +0.01 | [-0.10, +0.13] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api_v3 | ingress throughput | +0.01 | [-0.19, +0.21] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api | ingress throughput | +0.01 | [-0.21, +0.22] | 1 | Logs |
| ➖ | ddot_metrics_sum_delta | memory utilization | -0.02 | [-0.20, +0.17] | 1 | Logs |
| ➖ | file_to_blackhole_1000ms_latency | egress throughput | -0.02 | [-0.45, +0.41] | 1 | Logs |
| ➖ | file_to_blackhole_500ms_latency | egress throughput | -0.03 | [-0.43, +0.37] | 1 | Logs |
| ➖ | file_to_blackhole_100ms_latency | egress throughput | -0.03 | [-0.15, +0.09] | 1 | Logs |
| ➖ | quality_gate_idle | memory utilization | -0.04 | [-0.09, +0.00] | 1 | Logs bounds checks dashboard |
| ➖ | ddot_metrics | memory utilization | -0.06 | [-0.25, +0.13] | 1 | Logs |
| ➖ | ddot_metrics_sum_cumulativetodelta_exporter | memory utilization | -0.12 | [-0.34, +0.11] | 1 | Logs |
| ➖ | ddot_metrics_sum_cumulative | memory utilization | -0.28 | [-0.43, -0.13] | 1 | Logs |
| ➖ | docker_containers_memory | memory utilization | -0.44 | [-0.53, -0.36] | 1 | Logs |
| ➖ | uds_dogstatsd_20mb_12k_contexts_20_senders | memory utilization | -0.48 | [-0.55, -0.42] | 1 | Logs |
| ➖ | file_tree | memory utilization | -0.50 | [-0.55, -0.44] | 1 | Logs |
| ➖ | ddot_logs | memory utilization | -0.57 | [-0.64, -0.50] | 1 | Logs |
Bounds Checks: ✅ Passed
| perf | experiment | bounds_check_name | replicates_passed | observed_value | links |
|---|---|---|---|---|---|
| ✅ | docker_containers_cpu | simple_check_run | 10/10 | 677 ≥ 26 | |
| ✅ | docker_containers_memory | memory_usage | 10/10 | 275.78MiB ≤ 370MiB | |
| ✅ | docker_containers_memory | simple_check_run | 10/10 | 714 ≥ 26 | |
| ✅ | file_to_blackhole_0ms_latency | memory_usage | 10/10 | 0.19GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_0ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_1000ms_latency | memory_usage | 10/10 | 0.24GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_1000ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_100ms_latency | memory_usage | 10/10 | 0.20GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_100ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_500ms_latency | memory_usage | 10/10 | 0.22GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_500ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | quality_gate_idle | intake_connections | 10/10 | 4 = 4 | bounds checks dashboard |
| ✅ | quality_gate_idle | memory_usage | 10/10 | 174.94MiB ≤ 181MiB | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | intake_connections | 10/10 | 4 = 4 | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | memory_usage | 10/10 | 499.40MiB ≤ 550MiB | bounds checks dashboard |
| ✅ | quality_gate_logs | intake_connections | 10/10 | 4 ≤ 6 | bounds checks dashboard |
| ✅ | quality_gate_logs | memory_usage | 10/10 | 210.89MiB ≤ 220MiB | bounds checks dashboard |
| ✅ | quality_gate_logs | missed_bytes | 10/10 | 0B = 0B | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | cpu_usage | 10/10 | 366.01 ≤ 2000 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | intake_connections | 10/10 | 4 ≤ 6 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | memory_usage | 10/10 | 406.68MiB ≤ 475MiB | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | missed_bytes | 10/10 | 0B = 0B | bounds checks dashboard |
Explanation
Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%
Performance changes are noted in the perf column of each table:
- ✅ = significantly better comparison variant performance
- ❌ = significantly worse comparison variant performance
- ➖ = no significant change in performance
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
CI Pass/Fail Decision
✅ Passed. All Quality Gates passed.
- quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
The allow-list check now lives in the admission webhook (mutator.go) and covers all injection modes (init-container, image-volume, CSI), not just CSI. Rename the config key accordingly.
Test three cases: - Empty allow list permits injection from any registry (backward compat) - Registry in allow list permits injection - Registry not in allow list blocks injection and sets error annotation
…-allow-list-webhook
Replace global.containerRegistryAllowList approach with setting DD_ADMISSION_CONTROLLER_AUTO_INSTRUMENTATION_CONTAINER_REGISTRY_ALLOW_LIST directly via clusterAgent.env, so the test works with the current helm chart version without needing the helm-charts PR to merge first. Two test cases: - TestRegistryAllowListBlocked: allow list = fake.registry.invalid, injection is blocked and error annotation is set - TestRegistryAllowListAllowed: allow list = registry.datadoghq.com (the injector's registry), injection proceeds and traces arrive
Deploy two apps in the same cluster with allow list = registry.datadoghq.com: - registry-allow-list-allowed: uses default injector, injection proceeds - registry-allow-list-blocked: pod annotation overrides injector image to fake.registry.invalid (not in allow list), injection is blocked This avoids a second UpdateEnv call (and second cluster setup) by using the admission.datadoghq.com/apm-inject.custom-image annotation to point one pod at an injector registry that is not in the allow list.
The previous check only validated the injector image registry. A user could bypass the allow list by annotating a pod with admission.datadoghq.com/python-lib.custom-image pointing to an arbitrary registry. Now InjectAPMLibraries checks every library's registry against the allow list in addition to the injector's registry. Adds a unit test for the library registry case and a third e2e scenario (LibraryRegistryBlockedByAllowList) that uses a python-lib.custom-image annotation pointing to fake.registry.invalid to verify the check fires.
Two bugs: 1. kpiEnvVarsMutator (which sets DD_INSTRUMENTATION_INSTALL_TYPE) runs before apmInjectionMutator (which contains the allow list check in InjectAPMLibraries). Blocked pods were getting DD_INSTRUMENTATION_INSTALL_TYPE set even though injection was blocked, causing RequireNoInjection to fail. Fix: add the allow list check at the start of injectTracers, before any mutators run. 2. Our clusterAgent.env override in the test YAML replaced the framework's default env list, losing GRADUAL_ROLLOUT_ENABLED=false. With gradual rollout enabled the injector image resolver may return an image from a different registry (not registry.datadoghq.com), causing the allow list check to block even the "allowed" pod. Fix: add GRADUAL_ROLLOUT_ENABLED=false back to the test YAML alongside our allow list env var.
…stry) The default admission_controller.container_registry is gcr.io/datadoghq, so injector and library images both come from gcr.io/datadoghq in the test environment. The allow list was set to registry.datadoghq.com which does not match, causing the allow list check to block the 'allowed' pod.
InjectAPMLibraries has only one production caller (namespace_mutator.go), which already runs the allow list check before any mutators. The check inside InjectAPMLibraries was redundant. Remove it and the unit tests that covered it at that level.
…-allow-list-webhook
…-allow-list-webhook
Without admission.datadoghq.com/enabled, the target-based path takes over and the python-lib.custom-image annotation is ignored. The pod gets injected with the target-configured image (gcr.io/datadoghq), which is in the allow list, so the test incorrectly fails. Add the enabled label so the annotation-based path processes the custom-image annotation, which the allow list then correctly blocks.
…-allow-list-webhook
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0b65fe7809
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if registry == "" { | ||
| return errors.New("image registry is not in the allow list") |
There was a problem hiding this comment.
In which cases can the registry be an empty string?
There was a problem hiding this comment.
When a custom image is provided without an explicit registry in an annotation.
For example, if a user sets:
admission.datadoghq.com/python-lib.custom-image: dd-lib-python-init:v3
There was a problem hiding this comment.
There was a problem hiding this comment.
I think that would be a breaking change. The annotation value is currently used as-is, without adding a registry
There was a problem hiding this comment.
I understand, but if we don't fix it, it means that the moment the user sets a non empty allowlist, all services having annotations not setting a registry will break. On large clusters it will be a mess to fix all of them.
Maybe we can just drop this conditional and allow the users to set "" in the allowlist so such annotations don't break ?
There was a problem hiding this comment.
In all cases, I am approving the PR, I don't think this is a blocker from our side, just raised this so we are all aware of the impact 👍
adel121
left a comment
There was a problem hiding this comment.
Left some small questions.
And one general thought (although it might have already been taken into consideration), it would be good to keep the configuration consistent across the agent and the csi driver. IMO the user should configure the allowlist in one place. Otherwise they can easily get out of sync.
I agree. Do you think any changes are needed in this PR, or is the related Helm Charts PR sufficient? https://github.com/DataDog/helm-charts/pull/2488/changes#diff-2fcec98016e714405bc936fd949012c2e72f23c20a1a94acf695a9cd3e29d75f |
Yes this goes to the helm chart and/or operator. Not in this PR for sure. |
What Does This Do
Adds a registry allow list for CSI-based APM library injection. When the CSI injection mode is used, the admission controller webhook now checks the configured registry allow list before adding CSI volumes to pods. If a library's registry is not permitted, the webhook skips the CSI volume entirely — the pod starts cleanly without blocked volumes.
This complements the CSI driver's own registry check (see DataDog/datadog-csi-driver#72). The webhook provides clean UX (no unnecessary volumes added), while the CSI driver acts as a security backstop (returns success on rejection so pods aren't blocked).
How to Test
The Helm chart (DataDog/helm-charts, branch
knusbaum/csi-driver-registry-allow-list) passesdatadog-csi-driver.apm.registryAllowListto the cluster-agent asDD_ADMISSION_CONTROLLER_AUTO_INSTRUMENTATION_CSI_REGISTRY_ALLOW_LIST.I have several
injector-devscenarios, yet to be publishedRelated PRs
Jira